Universal Character Set characters
The Unicode Consortium (UC) and the International Organisation for Standardisation (ISO) collaborate on the Universal Character Set. (UCS). The UCS is an international standard to map characters used in natural language (as opposed to programming languages for instance) characters into numeric — machine readable — values. By creating this mapping, the UCS enables computer software vendors to interoperate and transmit UCS encoded text strings from one to another
ISO maintains the basic mapping of characters from character name to code point. Often the terms character and code point will get used interchangeably. However, when a distinction is made, a code point refers to the integer of the character: what one might thing of as its address. While a character in UCS 10646 includes the combination of the code point and its name, Unicode adds many other properties to the character set. Together, these properties further define each character.
In addition to the UCS Unicode also provides other implementation details such as:
- transcending mappings between UCS and other character sets
- different collations of characters and character strings for different languages
- an algorithm for laying out bidirectional text, where text on the same line may shift between left-to-right and right-to-left
- a case folding algorithm
Computer software end users enter these characters into programs through various input methods. Input methods can be through keyboard or a graphical character palette.
Divisions of UCS
The UCS can be divided in various ways: plane, category, block, etc. Unicode and ISO divide it into 17 planes, each capable of containing 65,534 distinct characters or 1,114,078 total. As of 2007 (Unicode 5.0) ISO and the Unicode Consortium has only allocated characters and blocks in six of the 17 planes The others remain empty and reserved for future use.
- Basic Multilingual Plane (BMP). This plane contains most of the characters needed for scripts and languages in routine use in the world today. The plane is nearly filled with only approximately 3,700 of the 65,534 code points remaining to be defined.
- Supplementary Multilingual Plane (SMP). Currently used for many ancient scripts and characters as well as musical and mathematical notation.
- Supplementary Ideographic plane (SIP). Used for ideographic characters used in many languages in China, Japan, Korea, Taiwan, Vietnam and Singapore.
- Supplementary Special-purpose Plane (SSP). For special-purpose characters such as compatibility control characters.
- Private Use Plane A. Together the Private Use planes provide 131,068 characters — in addition to the 6,400 private use code points provided in the BMP — for definition by organizations outside Unicode and ISO 10646. Such private use definers might be operating system vendors, font vendors, or other independent standards organizations.
- Private Use Plane B.
By block
Unicode adds a block property to UCS that further divides each plane into separate blocks. Each block is a grouping of characters by their use such as "mathematical operators" or "Hebrew script characters". When assigning characters to previously unassigned code points, the Consortium typically allocates entire blocks of similar characters: for example all the characters belonging to the same script or all similarly purposed symbols get assigned to a single block. Blocks may also maintain unassigned or reserved code points when the Consortium expects a block to require additional assignments.
By type
UCS may also be divided according to the types of characters: script, symbol, diacritical, punctuation and so on.
Types include:
- Modern Scripts. As of 2006 (Unicode 5.0), the UCS identifies approximately 50 scripts in current use throughout of the world. Several more are in the early approval stages for future inclusion of the UCS.
- Ancient Scripts (Obsolete Scripts). UCS also includes many scripts no longer in use such as Linear B and Phoenician.
- International Phonetic Alphabet. The UCS devotes several blocks (over 300 characters) to characters for the International Phonetic Alphabet.
- Combining Diacritical Marks. An important advance conceived by Unicode in designing the UCS and related algorithms for handling text, was the introduction of combining diacritic marks. By providing accents that can combine with any letter character, the Unicode and the UCS reduce significantly the number of characters needed. While the UCS also includes precomposed characters, these were included primarily to facilitate support within UCS for non-Unicode text processing systems.
- Punctuation. Along with unifying diacritical, the UCS also sought to unify punctuation across scripts. Many scripts also contain punctuation, however, when that punctuation has no similar semantics in other scripts.
- Symbols. Many mathematics, technical, geometrical and other symbols are included within the UCS. This provides distinct symbols with their own code point or character rather than relying on switching fonts to provide symbolic glyphs.
- Currency.
- Letterlike. These symbols appear like combinations of many common Latin scripts letters such as ℅. Unicode designates many of the letterlike symbols as compatibility characters usually because they can be in plain text by substituting glyphs for a composing sequence of characters: for example substituting the glyph ℅ for the composed sequence of characters c/o.
- Number Forms. Number forms primarily consist of precomposed fractions and Roman numerals. Like other areas of composing sequences of characters, the Unicode approach prefers the flexibility of composing fractions by combining characters together. In this case to create fractions, one combines numbers with the fraction slash character (U+2044). As an example of the flexibility this approach provides, there are about a dozen precomposed fraction characters included within the UCS. However, there are an infinity of possible fractions. By using composing characters the infinity of fractions is handled by 11 characters (0-9 and the fraction slash). No character set could include code points for every precomposed fraction. Ideally a text system should present the same glyphs for a fraction whether it is one of the 12 precomposed fractions (such as ⅓) or a composing sequence of characters (such as 1⁄3). However, web browsers are not typically that sophisticated with Unicode and text handling. Doing so ensures that precomposed fractions and combining sequence fractions will appear compatible next to each other.
- Arrows.
- Mathematical Operators and Other Symbols.
- Geometric Shapes.
- Control Pictures Graphical representations of many control characters.
- Box Drawing.
- Block Elements.
- Braille Patterns.
- Optical Character Recognition.
- Technical.
- Dingbats.
- Other Miscellaneous Symbols.
- CJK. Devoted to ideographs and other characters to support languages in China, Japan, Korea (CJK), Taiwan, Vietnam, and Thailand.
- Radicals and Strokes.
- Ideographs. By far the largest potion of the UCS is devoted to ideographs used in languages of Eastern Asia. While the glyph representation of these ideographs have diverged in the languages that use them, the UCS unifies these Han characters in what Unicode refers to as Unihan (for Unified Han). With Unihan, the text layout software mush work together with the available fonts and these Unicode characters to produce the appropriate glyph for the appropriate language. Despite unifying these characters, the UCS still includes over 80,000 Unihan ideographs.
- Musical Notation.
- Compatibility Characters. Several blocks in the UCS are devoted almost entirely to compatibility characters. Compatibility characters are those included for support of legacy text handling systems that do not make a distinction between character and glyph the way Unicode does. For example, many Arabic letters are represented by a different glyph when the letter appears at the end of a word than when the letter appears at the beginning of a word. Unicode's approach prefers to have these letters mapped to the same character for ease of internal machine text processing and storage. To complement this approach, the text software must select different glyph variants for display of the character based on its context. Over 4,000 characters are included for such compatibility reasons.
- Control Characters.
- Surrogates. The UCS includes 2,048 code points in the Basic Multilingual Plane (BMP) for surrogate code point pairs. Together these surrogates allow any code point in the sixteen other planes to be addressed by using to surrogate code points. This provides a simple built-in method for encoding the 20.1 bit UCS within a 16 bit encoding such as UTF-16. In this way UTF-16 can represent any character within the BMP with a single 16-bit byte. Characters outside the BMP are then encoded using two 16-bit bytes (4 octets total) using the surrogate pairs.
- Private Use. The consortium provides several private use blocks and planes that can be assigned characters within various communities, as well as operating system and font vendors.
- Non-characters. The consortium guarantees certain code points will never be assigned a character and calls these non-character code points. The last two code points of each plane (ending in XFFFD and XFFFE ) are such code points. There are a few others interspersed throughout the Basic Multilingual Plane, the first plane.
Special code points
Among the millions of code points available in UCS, many are set aside for other uses or for designation by third parties. These set aside code points include non-character code points, surrogates, and private use code points.
Non-characters
Non-character code points are set aside and guaranteed to never be used for a character. Each of the 17 planes has its two ending code points set aside as non-characters. Another non-character code point is the reverse of the byte order mark (U+FEFF). When encountering the reverse byte order mark non-character, this serves as an indication that the byte order of the text has been misinterpreted.
Surrogates
The UCS uses surrogates to address characters outside the initial Basic Multilingual Plane without resorting to more than 16 bit byte representations. By combining pairs of the 2,048 surrogate code points, the remaining characters in all the other plains can be addressed (1,024 × 1,024 = 1,048,576 code points in the other 16 planes). In this way, UCS has a built-in 16 bit encoding capability for UTF-16.
Private use
The UCS guarantees it will never assign characters to these (137,468) code points. Operating system and font vendors and communities of end-users may use these for their own agreed-on use.
Characters grapheme clusters and glyphs
Whereas many other character sets assign a character for every, possible glyph representation of the character, Unicode seeks to treat characters separate from glyphs. This distinction is not always unambiguous, however a few examples will help illustrate the distinction. Often two characters may be combined together to typographically improve the readability of the text. For example, the three letter sequence "ffi", may be treated as a single glyph. Other characters sets would often assign a code point to this glyph in addition to the individual letters: "f" and "i".
In addition, Unicode approaches diacritic modified letters as separate characters that, when rendered, become a single glyph. For example, an "o" with diaeresis: "ö". Traditionally, other character sets assigned a unique character code point for each diacritic modified letter used in each language. Unicode seeks to create a more flexible approach by allowing combining diacritic characters to combine with any letter. This has the potential to significantly reduce the number of active code points needed for the character set. As an example, consider a language that uses the Latin script and combines the diaeresis with the upper- and lower-case letters "a", "o", and "u". With the Unicode approach, only the diaeresis diacritic character needs to be added to the character set to use with the Latin letters: "a", "A", "o", "O", "u", and "U": seven characters in all. A legacy character sets needs to add six precomposed letters with a diaeresis in addition to the six code points it uses for the letters without diaeresis: twelve character code points in total.
Compatibility characters
UCS includes thousands of characters that Unicode designates as compatibility characters. These are characters that were included in UCS in order to provide distinct code points for characters that other character sets differentiate, but would not be differentiated in the Unicode approach to characters.
The chief reason for this differentiation was that Unicode makes a distinction between characters and glyphs. For example, when writing English in a cursive style, the letter "i" may take different forms whether it appears at the beginning of a word, the end of a word, the middle of a word or in isolation. Languages such as Arabic written in an Arabic script are always cursive. Each letter has many different forms. UCS includes 731 Arabic form characters that decompose to just approximately 100 unique Arabic characters. However, the additional 731 Arabic characters are included so that text processing software may translate text from other characters sets to UCS and back again without any loss of information crucial for non-Unicode software.
However, for UCS and Unicode in particular, the preferred approach is to always encode or map that letter to the same character no matter where it appears in a word. Then the distinct forms of each letter are determined by the font and text layout software methods. In this way, the internal memory for the characters remains identical regardless of where the character appears in a word. This greatly simplifies searching, sorting and other text processing operations.